7 research outputs found

    Supporting Account-based Queries for Archived Instagram Posts

    Get PDF
    Social media has become one of the primary modes of communication in recent times, with popular platforms such as Facebook, Twitter, and Instagram leading the way. Despite its popularity, Instagram has not received as much attention in academic research compared to Facebook and Twitter, and its significant role in contemporary society is often overlooked. Web archives are making efforts to preserve social media content despite the challenges posed by the dynamic nature of these sites. The goal of our research is to facilitate the easy discovery of archived copies, or mementos, of all posts belonging to a specific Instagram account in web archives. We proposed two approaches to support account-based queries for archived Instagram posts. The first approach uses existing technologies in the Internet Archive by using WARC revisit records to incorporate Instagram usernames into the WARC-Target-URI field in the WARC file header. The second approach involves building an external index that maps Instagram user accounts to their posts. The user can query this index to retrieve all post URLs for a particular user, which they can then use to query web archives for each individual post. The implementation of both approaches was demonstrated, and their advantages and disadvantages were discussed. This research will enable web archivists to make informed decisions on which approach to adopt based on practicality and unique requirements for their archives

    MetaEnhance: Metadata Quality Improvement for Electronic Theses and Dissertations

    Get PDF
    Metadata quality is crucial for digital objects to be discovered through digital library interfaces. Although DL systems have adopted Dublin Core to standardize metadata formats (e.g., ETD-MS v1.11), the metadata of digital objects may contain incomplete, inconsistent, and incorrect values [1]. Most existing frameworks to improve metadata quality rely on crowdsourced correction approaches, e.g., [2]. Such methods are usually slow and biased toward documents that are more discoverable by users. Artificial intelligence (AI) based methods can be adopted to overcome this limit by automatically detecting, correcting, and canonicalizing the metadata, featuring quick and unbiased responses to document metadata. This paper uses Electronic Theses and Dissertations (ETDs) metadata as a case study and proposes an AI-based framework to improve metadata quality. ETD represents scholarly works of students who pursue higher education and successfully meet the partial requirement of a degree. ETDs are usually hosted by university libraries or ProQuest. Using web crawling techniques, we collected metadata and full text of 533,047 ETDs from 114 American universities. Upon inspecting the metadata of these ETDs, we noticed many ETD repositories are accompanied by incomplete, inconsistent, or incorrect metadata. We propose MetaEnhance, a framework that utilizes state-of-the-art AI methods to improve the quality of seven key metadata fields, including title, author, university, year, degree, advisor, and department. To evaluate MetaEnhance, we compiled a benchmark containing 500 ETDs, by combining subsets sampled using different criteria. We evaluated MetaEnhance against this benchmark and found that the proposed methods achieved remarkable performance in detecting and correcting metadata errors.https://digitalcommons.odu.edu/gradposters2023_sciences/1013/thumbnail.jp

    Robots Still Outnumber Humans in Web Archives in 2019, But Less Than in 2012

    Get PDF
    To identify robots and human users in web archives, we conducted a study using the access logs from the Internet Archive’s (IA) Wayback Machine in 2012 (IA2012), 2015 (IA2015), and 2019 (IA2019), and the Portuguese Web Archive (PT) in 2019 (PT2019). We identified user sessions in the access logs and classified them as human or robot based on their browsing behavior. In 2013, AlNoamany et al. [1] studied the user access patterns using IA access logs from 2012. They established four web archive user access patterns: single-page access (Dip), access to the same page at multiple archive times (Dive), access to distinct web archive pages at about the same archive time (Slide), and access to a list of archived pages (TimeMaps) for a certain URL (Skim). They also determined that in the 2012 IA access logs, humans were outnumbered by robots by 10:1 in terms of sessions and 5:4 in terms of raw HTTP accesses. We extended their work by presenting a comparison of detected robots vs. humans and their access patterns and temporal preferences based on the two archives (IA vs. PT) and between three years of IA access logs (IA2012, IA2015, IA2019). The total number of robots detected in IA2012 (91% of requests) and IA2015 (88% of requests) is greater than in IA2019 (70% of requests). Robots account for 98% of requests in PT2019. We found that the robots are almost entirely limited to Dip and Skim access patterns in IA2012 and IA2015, but exhibit all the patterns and their combinations in IA2019. We also investigated the temporal preferences of the users and discovered that both humans and robots favor web pages that have been archived recently. [1] AlNoamany, Y., Weigle, M.C., Nelson, M.L.: Access patterns for robots and humans in web archives. In: JCDL ’13: Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries. pp. 339–348 (2013), https://dl.acm.org/doi/10.1145/2467696.2467722https://digitalcommons.odu.edu/gradposters2023_sciences/1022/thumbnail.jp

    Automatic Metadata Extraction Incorporating Visual Features from Scanned Electronic Theses and Dissertations

    Get PDF
    Electronic Theses and Dissertations (ETDs) contain domain knowledge that can be used for many digital library tasks, such as analyzing citation networks and predicting research trends. Automatic metadata extraction is important to build scalable digital library search engines. Most existing methods are designed for born-digital documents, so they often fail to extract metadata from scanned documents such as for ETDs. Traditional sequence tagging methods mainly rely on text-based features. In this paper, we propose a conditional random field (CRF) model that combines text-based and visual features. To verify the robustness of our model, we extended an existing corpus and created a new ground truth corpus consisting of 500 ETD cover pages with human validated metadata. Our experiments show that CRF with visual features outperformed both a heuristic and a CRF model with only text-based features. The proposed model achieved 81.3%-96% F1 measure on seven metadata fields. The data and source code are publicly available on Google Drive (https://tinyurl.com/y8kxzwrp) and a GitHub repository (https://github.com/lamps-lab/ETDMiner/tree/master/etd_crf), respectively.Comment: 7 pages, 4 figures, 1 table. Accepted by JCDL '21 as a short pape

    MetaEnhance: Metadata Quality Improvement for Electronic Theses and Dissertations of University Libraries

    Full text link
    Metadata quality is crucial for digital objects to be discovered through digital library interfaces. However, due to various reasons, the metadata of digital objects often exhibits incomplete, inconsistent, and incorrect values. We investigate methods to automatically detect, correct, and canonicalize scholarly metadata, using seven key fields of electronic theses and dissertations (ETDs) as a case study. We propose MetaEnhance, a framework that utilizes state-of-the-art artificial intelligence methods to improve the quality of these fields. To evaluate MetaEnhance, we compiled a metadata quality evaluation benchmark containing 500 ETDs, by combining subsets sampled using multiple criteria. We tested MetaEnhance on this benchmark and found that the proposed methods achieved nearly perfect F1-scores in detecting errors and F1-scores in correcting errors ranging from 0.85 to 1.00 for five of seven fields.Comment: 7 pages, 3 tables, and 1 figure. Accepted by 2023 ACM/IEEE Joint Conference on Digital Libraries (JCDL '23) as a short pape

    Creating Structure in Web Archives With Collections: Different Concepts From Web Archivists

    Full text link
    As web archives' holdings grow, archivists subdivide them into collections so they are easier to understand and manage. In this work, we review the collection structures of eight web archive platforms: : Archive-It, Conifer, the Croatian Web Archive (HAW), the Internet Archive's user account web archives, Library of Congress (LC), PANDORA, Trove, and the UK Web Archive (UKWA). We note a plethora of different approaches to web archive collection structures. Some web archive collections support sub-collections and some permit embargoes. Curatorial decisions may be attributed to a single organization or many. Archived web pages are known by many names: mementos, copies, captures, or snapshots. Some platforms restrict a memento to a single collection and others allow mementos to cross collections. Knowledge of collection structures has implications for many different applications and users. Visitors will need to understand how to navigate collections. Future archivists will need to understand what options are available for designing collections. Platform designers need it to know what possibilities exist. The developers of tools that consume collections need to understand collection structures so they can meet the needs of their users.Comment: 5 figures, 16 pages, accepted for publication at TPDL 202

    The DSA Toolkit Shines Light Into Dark and Stormy Archives

    Get PDF
    Web archive collections are created with a particular purpose in mind. A curator selects seeds, or original resources, which are then captured by an archiving system and stored as archived web pages, or mementos. The systems that build web archive collections are often configured to revisit the same original resource multiple times. This is incredibly useful for understanding an unfolding news story or the evolution of an organization. Unfortunately, over time, some of these original resources can go off-topic and no longer suit the purpose for which the collection was originally created. They can go off-topic due to web site redesigns, changes in domain ownership, financial issues, hacking, technical problems, or because their content has moved on from the original topic. Even though they are off-topic, the archiving system will still capture them, thus it becomes imperative to anyone performing research on these collections to identify these off-topic mementos. Hence, we present the Off-Topic Memento Toolkit, which allows users to detect off-topic mementos within web archive collections. The mementos identified by this toolkit can then be separately removed from a collection or merely excluded from downstream analysis. The following similarity measures are available: byte count, word count, cosine similarity, Jaccard distance, Sørensen-Dice distance, Simhash using raw text content, Simhash using term frequency, and Latent Semantic Indexing via the gensim library. We document the implementation of each of these similarity measures. We possess a gold standard dataset generated by manual analysis, which contains both off-topic and on-topic mementos. Using this gold standard dataset, we establish a default threshold corresponding to the best F1 score for each measure. We also provide an overview of potential future directions that the toolkit may take
    corecore